Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP #19419

Merged
merged 12 commits into from
Feb 23, 2024

Conversation

mtavenrath
Copy link
Contributor

Description

Replace deprecated cuDNN RNN based API by cuDNN v8 RNN API and reenable RNN tests for the CUDA EP.

Motivation and Context

The deprecated cuDNN RNN API might vanish soon and in addition for the current CUDA EP RNN implementation all RNN tests are disabled due to failures. With this change the deprecated API has been removed and the new updated implemented doesn't fail the tests anymore.

@gedoensmax
Copy link
Contributor

@hariharans29 I believe we talked about some deprecated APIs via mail. Markus took it on to fix this. A review and probably guidance on testing would be much appreciated.

@mtavenrath
Copy link
Contributor Author

@hariharans29 Can you please trigger the CI again? I accidentially removed a single } during cleanup of my PR.

@mtavenrath
Copy link
Contributor Author

cuDNN v9.0.0 got released today. It removed the deprecated APIs I have replaced and thus the Cuda EP of onnxruntime will not compile anymore without this PR.

@gedoensmax
Copy link
Contributor

@pranavsharma for viz due to cuDNN 9 discussions.

@mtavenrath mtavenrath changed the title Replace deprecated cuDNN RNN APIs by new cuDNN v8 APIs and re-enable RNN tests which have been broken before. Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP Feb 8, 2024
@hariharans29
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@hariharans29
Copy link
Member

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

@hariharans29
Copy link
Member

/azp run Big Models

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@mtavenrath
Copy link
Contributor Author

I've pushed an update which fixes one RNN test, lintrunner issues and linux compile warnings. For some reasons my Windows build doesn't show those even with /W3.

@hariharans29
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@hariharans29
Copy link
Member

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

@hariharans29
Copy link
Member

/azp run Big Models

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@mtavenrath
Copy link
Contributor Author

The warning level on Windows and Linux is annoyingly different than the one on Linux. Unused local variables are supposed to be enabled with /W4 (C4189) where the default for ORT is /W3. Still with /W4 (or #pragma warning(3:4189) I'm still not able to enable this warning.

@hariharans29
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@hariharans29
Copy link
Member

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline

@hariharans29
Copy link
Member

/azp run Big Models

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@mtavenrath
Copy link
Contributor Author

I've found one (potentially random) failing test on Windows. The problem with this test is that I cannot reproduce it on my local system (recent driver, RTX 6000 Ada). Which cuDNN version is being used, what kind of GPU in installed in the test machine and which driver version is being used?

1: [ OK ] GRUTest.ONNXRuntime_TestGRUOpGrowBatchSequenceLength (48 ms)
1: [ RUN ] GRUTest.ONNXRuntime_TestGRUOpGrowBatchSequenceLengthLinearBeforeReset
1: 2024-02-20 20:04:01.7835208 [E:onnxruntime:Default, cuda_call.cc:118 onnxruntime::CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=b345d2c5c000000 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle()));
1: 2024-02-20 20:04:01.7838589 [E:onnxruntime:Default, cuda_call.cc:118 onnxruntime::CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=b345d2c5c000000 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=412 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_));
1: D:\a_work\1\s\onnxruntime\test\providers\base_tester.cc(323): error: Expected equality of these values:
1: expect_result
1: Which is: 4-byte object <00-00 00-00>
1: ExpectResult::kExpectFailure
1: Which is: 4-byte object <01-00 00-00>
1: Run failed but expected success: CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=b345d2c5c000000 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle()));
1: Google Test trace:
1: D:\a_work\1\s\onnxruntime\test\providers\base_tester.cc(791): registered execution providers: CUDAExecutionProvider
1: Stack trace:

@tianleiwu
Copy link
Contributor

The test is done in A10 GPU with CUDA 11.8 and cuDNN 8.5.0.96 (According to https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements).
@snnn, could you confirm the cuDNN verison and driver version?

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models

@tianleiwu
Copy link
Contributor

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@mtavenrath
Copy link
Contributor Author

I was able to reproduce the failure by downgrading using cuDNN 8.5 for CUDA 11.8. Starting with cuDNN 8.9.1 one pointer is no longer required and this one was incorrect in general. I guess that most uses of cudnnRNNForward didn't use the sequence len buffer anymore except for the single one case in the failing test.

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Android CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@tianleiwu
Copy link
Contributor

/azp run Big Models

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mtavenrath
Copy link
Contributor Author

All CIs except for IOS succeeded. The IOS related failure is unrelated to this PR.

@tianleiwu tianleiwu merged commit efbe2b8 into microsoft:main Feb 23, 2024
80 of 81 checks passed
YUNQIUGUO pushed a commit that referenced this pull request Feb 27, 2024
…NN v8 RNN API and reenable RNN tests for CUDA EP (#19419)

Replace deprecated cuDNN RNN based API by cuDNN v8 RNN API and re-enable
RNN tests for the CUDA EP.

### Motivation and Context
The deprecated cuDNN RNN API might vanish soon and in addition for the
current CUDA EP RNN implementation all RNN tests are disabled due to
failures. With this change the deprecated API has been removed and the
new updated implemented doesn't fail the tests anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants